Return to Statistical Modelling Section
The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science. More precisely, the data set was part of season 1 episode 5 of SLICED, a data science competition streamed by Nick Wan and Meg Risdal on Twitch.
This dataset is about the prices of Airbnb listings in New York City. The purpose of this post is to demonstrate the usefulness of the package stacks in blending individual machine learning models together into a linear combination of them, often increasing final model performance.
Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions, as it doesn’t contain the target variable. The training data is fairly large, holding information on 34,226 listings with 15 predictors and the response variable of the listings price per night.
names(data)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
There exists considerable missing data in the reviews per month (double) and last review (date) column and little missing data in the host name (character) and listing name (character) column. All missing values can likely be imputed, that is if they are missing at random.
colMeans(is.na(data)) %>%
tidy() %>%
rename(pct = x) %>%
mutate(names = fct_reorder(names, pct)) %>%
# filter(pct > 0) %>%
ggplot(aes(pct, names)) +
geom_col(fill = "midnightblue") +
labs(title = "Missing Data In Variables",
subtitle = "Percent missingness calculated for each column",
y = NULL,
x = NULL) +
scale_x_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
The target variable distributions of missing and non-missing values in the two columns with considerable missingness look like they don’t exhibit statistically significant differences and are comparable. Therefore, the imputation of these missing values should not pose a problem, even though this conclusion has to be taken with a grain of salt: It is not possible to determine for sure if a variable is missing at random with observed data, it can merely be assumed.
data %>%
transmute(reviews_per_month = ifelse(is.na(reviews_per_month),
"missing",
"not missing"),
last_review = ifelse(is.na(last_review),
"missing",
"not missing"),
price) %>%
pivot_longer(-c(price), names_to = "variable", values_to = "state") %>%
ggplot(aes(variable, price, fill = state)) +
geom_boxplot(outlier.alpha = 0.2) +
labs(y = "Price",
x = NULL,
fill = "Status:") +
ggsci::scale_fill_locuszoom() +
scale_y_log10(labels = scales::dollar_format()) +
theme_bw()
The next step is to walk through the available predictors and understand relations to the target variable. Below, every variable is briefly looked at and presented, enabling a better understanding of the complete training data.
If you are just interested in how to build a model stack with stacks, feel free to skip this part and continue at Building And Training The Stacked Model.
The unique identifier column for each listing is a random number and should not hold any predictive power.
cor(data$price, data$id)
## [1] 0.009927947
The name variable contains the title of the listing, which will be useful for tokenisation at a later stage.
data %>% count(name, sort = T)
The unique identifier column for each listing is a random number and should not contain any predictive power in theory. However, with multiple listings per host and some hosts specialising in luxury apartments or affordable housing, for instance, there might be additional insights in the variable. Therefore, it will still be included in the recipe.
cor(data$price, data$host_id)
## [1] 0.01191348
The host name variable contains the first names of the hosts. There is likely no deterministic component to them except for known hosts, like Blueground, who specialise in renting out furnished appartments for longer periods. Therefore, it will not be used as a predictive variable, as the relation might be too fragile for the out-of-sample application of the model.
data %>%
group_by(host_name) %>%
summarise(n = n(),
mean_price = mean(price)) %>%
filter(n > 100) %>%
mutate(host_name = paste0(host_name, " (N=", n, ")") %>%
fct_reorder(mean_price)) %>%
ggplot(aes(mean_price, host_name, size = n)) +
geom_point(colour = "midnightblue") +
labs(title = "Mean NYC Airbnb Prices By Host Names",
subtitle = "Only names with observation count >100 are shown",
x = "Mean Price Per Night",
y = NULL,
size = "Observation Count:") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_size_continuous(labels = scales::comma_format(),
range = c(2,5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
There are only five neighbourhood groups. However, there are very large differences in the price distributions between them, so they will be particulary useful as nominal predictors.
data %>%
ggplot(aes(neighbourhood_group %>% fct_reorder(price),
price,
fill = neighbourhood_group)) +
geom_boxplot(show.legend = F, outlier.alpha = 0.5, alpha = 0.6) +
labs(title = "NYC Airbnb Price Distribution By Neighbourhood Group",
subtitle = NULL,
x = NULL,
y = "Price Per Night") +
scale_y_log10(labels = scales::dollar_format()) +
ggsci::scale_fill_locuszoom() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
data %>%
group_by(neighbourhood_group) %>%
summarise(n = n(),
mean_price = mean(price)) %>%
mutate(neighbourhood_group = paste0(neighbourhood_group, " (N=", n, ")") %>%
fct_reorder(mean_price)) %>%
ggplot(aes(mean_price, neighbourhood_group, size = n)) +
geom_point(colour = "midnightblue") +
labs(title = "Mean NYC Airbnb Prices By Neighbourhood Group",
subtitle = NULL,
x = "Mean Price Per Night",
y = NULL,
fill = "Observation Count:") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_size_continuous(labels = scales::comma_format(),
range = c(2,5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
Neighbourhoods, similarly to neighbourhood groups, are useful indicators for prices of Airbnbs in NYC. However, due to the high cardinality, they will have to be lumped together in order not to exceed the memory limits of my machine.
data %>%
group_by(neighbourhood) %>%
summarise(n = n(),
mean_price = mean(price)) %>%
mutate(neighbourhood = paste0(neighbourhood, " (N=", n, ")") %>%
fct_reorder(mean_price)) %>%
slice_max(order_by = n, n = 25) %>%
ggplot(aes(mean_price, neighbourhood, size = n)) +
geom_point(colour = "midnightblue") +
labs(title = "Mean NYC Airbnb Prices By Neighbourhood",
subtitle = "Only the top 25 most frequent neighbourhoods are shown.",
x = "Mean Price Per Night",
y = NULL,
size = "Observation Count:") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_size_continuous(labels = scales::comma_format(),
range = c(2,5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
Latitude and longitude give information about the location of the listings. This enables me to make a map. Again, it looks like Manhattan is the most expensive place to rent an Airbnb, which will be useful for the model.
data %>%
ggplot(aes(longitude, latitude, z = price)) +
stat_bin_hex(bins = 100) +
labs(title = "Hexagon Plot Of Log NYC Airbnb Prices",
subtitle = "Longitude and latitude are binned into 100 hexagons.",
x = NULL,
y = NULL,
fill = NULL) +
coord_equal() +
scale_alpha_continuous(range = c(0, 1), trans = "log") +
scale_fill_gradient(low = "azure3", high = "red") +
theme_void() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
See above.
The target is log-normally distributed, hence a log transform would be appropriate for a linear model, for instance. SLICED using RMSLE as a metric for evaluation, the target variable will transformed in the XGBoost model as well in this case, in order for the on-board RMSE metric of the Tidymodels package to apply (just a decision of convenience in this case).
data %>%
ggplot(aes(price)) +
geom_histogram(fill = "midnightblue", colour = "white") +
labs(title = "Distribution Of NYC Airbnb Listing Prices",
subtitle = NULL,
y = "Frequency",
x = "Price Per Night") +
scale_x_log10(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
Creating 100 bins for minimum nights, taking the average price per night by percentile and plotting the correlation in a scatter plot reveals that there might be a positive effect of minimum nights on listing price. This has likely to do with apartment being more expensive than private rooms and apartments being more likely of having a contractual obligation of minimum stay.
data %>%
mutate(minimum_nights_ntile = ntile(minimum_nights, n = 100)) %>%
group_by(minimum_nights_ntile) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(minimum_nights_ntile, mean_price)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Correlation Of Minimum Nights And NYC Airbnb Price",
subtitle = "Minimum nights have been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Minimum Nights Percentiles") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
The number of reviews shows a slightly negative effect.
data %>%
mutate(ntile = ntile(number_of_reviews, n = 100)) %>%
group_by(ntile) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(ntile, mean_price)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Correlation Of Number Of Reviews And NYC Airbnb Price",
subtitle = "Number of reviews have been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Number of Reviews Percentiles") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
This variable will probably generate more interpretable value for the model if I transform it into a days since last review from today’s perspective. This will have to be done in the recipe, in order for it to equally be applied to the holdout set as well.
There does not seem to be any relation between days since last review and price. If anything, more recently reviewed Airbnbs have slightly higher prices, but the confidence bands don’t show a large statistically significant effect.
data %>%
mutate(ntile = ntile(last_review, n = 100)) %>%
group_by(ntile) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(ntile, mean_price)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Correlation Of Days Since Last Review And NYC Airbnb Price",
subtitle = "Days since last review have been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Days Since Last Review Percentiles") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
There seems to be a slightly negative relation between reviews per month and price.
data %>%
mutate(ntile = ntile(reviews_per_month, n = 100)) %>%
group_by(ntile) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(ntile, mean_price)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "Correlation Of Reviews Per Month And NYC Airbnb Price",
subtitle = "Reviews per month have been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Reviews Per Month Percentiles") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "none")
It seems like hosts with the highest number of listings, that is professional hosts, likely have higher room prices on average, even though that might be a non-significant outlier.
The binned averages show a non-linear trend downwards and a sudden increase in the highest ten percent for both entire homes and private rooms. However, this trend is reversed for shared rooms. This means, that the hosts with exceptionally many listings likely have expensive homes or private rooms, but likely cheaper shared rooms. I wonder if the first two are a reflection of a dominant position of larger firms in the market, or whether it has something to do with the quality of the offerings. For the shared apartments, I wonder whether it has something to do with subsidised housing or larger firms specialising in affordable rooms and owning entire buildings of rooms. Either way, very interesting.
data %>%
mutate(ntile = ntile(calculated_host_listings_count, n = 100)) %>%
group_by(ntile, room_type) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(ntile, mean_price, colour = room_type)) +
geom_point() +
geom_smooth(method = "loess", se = F) +
labs(title = "Correlation Of Listings Per Host And NYC Airbnb Price",
subtitle = "Listings per host have been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Listings Per Host (Percentiles)",
colour = "Room Type") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
ggsci::scale_colour_jama() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
Again, very interesting to observe that an interaction between room type and another variable exists. For private rooms, availability throughout the year has virtually no effect on price, whereas more available shared rooms are usually cheaper and more available entire apartments and houses are the most expensive. I believe that the latter might be due to location, as most touristy places are available all year around, therefore attracting people with concentrated spending power over shorter time periods. These offerings also have to compensate for the time being empty, if no tourists are around.
Anyway, this being speculation, let’s get into the thick of it and continue with correlation analysis for the numeric predictors.
data %>%
mutate(ntile = ntile(availability_365, n = 100)) %>%
group_by(ntile, room_type) %>%
summarise(mean_price = mean(price)) %>%
ggplot(aes(ntile, mean_price, colour = room_type)) +
geom_point() +
geom_smooth(method = "loess", se = F) +
labs(title = "Correlation Of Availability And NYC Airbnb Price",
subtitle = "Availability per year has been binned into 100 percentiles and averaged.",
y = "Mean Price Per Night",
x = "Availability Per Year (Percentiles)",
colour = "Room Type") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format()) +
ggsci::scale_colour_jama() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
Let’s take a look at the correlation matrix from the GGally package to gauge relations of numeric predictors with the target variable.
data %>%
mutate(last_review = difftime(Sys.time(), last_review) %>% as.numeric()) %>%
select_if(is.numeric) %>%
select(-price, price) %>%
drop_na() %>%
ggcorr(label = T, label_size = 3)
It does not look like numeric variables are good predictors of the price in this setting. However, the correlation only being a linear measure, it might well be that a non-linear machine learning model will figure out non-linear relationships that are hidden right now.
Proceeding to look at variance inflation factors:
data %>%
select_if(is.numeric) %>%
lm(formula = price ~ .) %>%
vif()
## id host_id
## 2.164903 1.654317
## latitude longitude
## 1.011668 1.067277
## minimum_nights number_of_reviews
## 1.039035 2.322328
## reviews_per_month calculated_host_listings_count
## 2.297050 1.082071
## availability_365
## 1.141782
VIFs are lower than 10 for all numeric variables, so there is no problem of multicollinearity that has to be dealt with or at least mentioned.
With the ideas gathered from the exploratory data analysis, I can now proceed with building the model.
First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
dt_split <- data %>%
mutate(price = log(price + 1)) %>%
initial_split(strata = "price")
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 3, strata = "price")
The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
gb_rec <- recipe(price ~ .,
data = dt_train) %>%
step_rm(host_name) %>%
step_novel(all_nominal_predictors()) %>%
step_tokenize(name) %>%
step_stopwords(name) %>%
step_tokenfilter(name, max_tokens = 40) %>%
step_tf(name) %>%
step_mutate(last_review = difftime(Sys.time(), last_review) %>%
as.numeric()) %>%
step_impute_median(all_numeric_predictors()) %>%
step_other(neighbourhood, threshold = 0.001) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
en_rec <- recipe(price ~ .,
data = dt_train) %>%
step_rm(host_name) %>%
step_novel(all_nominal_predictors()) %>%
step_tokenize(name) %>%
step_stopwords(name) %>%
step_tokenfilter(name, max_tokens = 40) %>%
step_tf(name) %>%
step_mutate(last_review = difftime(Sys.time(), last_review) %>%
as.numeric()) %>%
step_impute_median(all_numeric_predictors()) %>%
step_other(neighbourhood, threshold = 0.001) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
Setting up the model specifications with tuning options for hyperparameters:
gb_spec <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("regression")
en_spec <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_engine("glmnet")
In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:
gb_wflow <-
workflow() %>%
add_recipe(gb_rec) %>%
add_model(gb_spec)
en_wflow <-
workflow() %>%
add_recipe(en_rec) %>%
add_model(en_spec)
Setting up a space-filling design for time-efficient hyperparameter tuning:
gb_grid <-
grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), dt_train),
learn_rate(),
size = 50
)
en_grid <-
grid_latin_hypercube(
penalty(),
mixture(),
size = 100
)
Now, the hyperparameters can be trained with parallel computing in order to utilise more available computing power.
# Gradient Boosting
start_time = Sys.time()
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
gb_tune <- tune_grid(object = gb_wflow,
resamples = folds,
grid = gb_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
end_time = Sys.time()
end_time - start_time
## Time difference of 14.06302 mins
# Elastic Net
start_time = Sys.time()
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
en_tune <- tune_grid(object = en_wflow,
resamples = folds,
grid = en_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
end_time = Sys.time()
end_time - start_time
## Time difference of 1.487113 mins
Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high.
gb_tune %>%
show_best(metric = "rsq") %>%
transmute(model = "XGBoost", .metric, mean, n, std_err)
en_tune %>%
show_best(metric = "rsq") %>%
transmute(model = "Elastic Net", .metric, mean, n, std_err)
Before creating a stacked model, let’s take a look at the variable importance within both individual models.
gb_final_wflow <- gb_wflow %>%
finalize_workflow(select_best(gb_tune, metric = "rmse"))
gb_final_fit <- gb_final_wflow %>%
last_fit(dt_split)
gb_final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 20) %>%
ggplot(aes(Importance, reorder(Variable, Importance))) +
geom_col(fill = "midnightblue", colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Relative Variable Importance") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
For the XGBoost model, the type of room as well as the location, especially information about Manhattan, was important to predict price. It becomes visible now, how important the inclusion of nominal predictors was for model performance.
en_final_wflow <- en_wflow %>%
finalize_workflow(select_best(en_tune, metric = "rmse"))
en_final_fit <- en_final_wflow %>%
last_fit(dt_split)
en_final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 30) %>%
mutate(Importance = ifelse(Sign == "NEG", Importance * -1, Importance)) %>%
ggplot(aes(Importance, reorder(Variable, Importance),
fill = Sign)) +
geom_col(colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Coefficient") +
ggsci::scale_fill_jama() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
For the elastic net, interestingly, the neighbourhoods were very decisive. Only the 30 most important variables are shown, and most of them contain information on geographic location from the neighbourhood variable. Furthermore, most of the important variables negatively impact price.
With both these individual tuning results, a blended (“stacked”) model can easily be built with the stacks package.
blended_gb_en <- stacks() %>%
add_candidates(gb_tune) %>%
add_candidates(en_tune) %>%
blend_predictions()
blended_gb_en
## # A tibble: 5 × 3
## member type weight
## <chr> <chr> <dbl>
## 1 gb_tune_1_44 boost_tree 7426.
## 2 gb_tune_1_31 boost_tree 991.
## 3 gb_tune_1_06 boost_tree 0.862
## 4 gb_tune_1_35 boost_tree 0.0756
## 5 gb_tune_1_24 boost_tree 0.0210
The stacks package creates a model additively blending the predictions from the separately trained models before. The optimisation for this is built into the package shows the output above. Interestingly, no elastic net candidate was chosen. Instead, a linear combination of XGBoost models is selected. In order to proceed with the prediction on the final holdout set, the stack is now fitted onto the training data.
blended_gb_en <- blended_gb_en %>%
fit_members()
Using the fitted model to predict and evaluate on the test set:
blended_gb_en %>%
predict(dt_test) %>%
bind_cols(dt_test %>% select(price)) %>%
rsq(.pred, truth = price) %>%
mutate(model = "Stacked Model")
Success! The blended model stack attained an \(R^2\) slightly higher than the individual XGBoost model on the test data set. This goes to show how stacking individual models can give the final predictions an additional edge.
gb_final_fit %>%
extract_workflow() %>%
predict(dt_test) %>%
bind_cols(dt_test %>% select(price)) %>%
rsq(.pred, truth = price) %>%
mutate(model = "XGBoost Model")
blended_gb_en %>%
predict(dt_test) %>%
bind_cols(dt_test %>% select(price)) %>%
ggplot(aes(exp(price), exp(.pred))) +
geom_point(colour = "midnightblue", alpha = 0.4) +
geom_abline(lty = "dashed", colour = "grey50") +
scale_x_log10(labels = scales::dollar_format()) +
scale_y_log10(labels = scales::dollar_format()) +
labs(title = "Out-Of-Sample Fit Of The Blended Model",
subtitle = NULL,
y = "Prediction",
x = "Truth") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
With the trained stack model, I can now make predictions for the holdout dataset, which will be submitted to the leader board on Kaggle.
blended_gb_en %>%
predict(holdout) %>%
bind_cols(holdout %>% select(id)) %>%
transmute(id, price = exp(.pred))
This model ranks at 5/30 on the SLICED competition leader board, which I believe speaks volumes about the power of XGBoost and stacking in competition settings given the lack of new features and extensive tuning in this post.
Conclusively, it can be said that the models performed fairly well in fitting the data, even though the predictions are not highly impressive seen from an absolute perspective. In order to make them highly accurate, more information on the level of the listings would have been useful, for instance size of the rooms, capacity, proxies for luxuriousness, details on reviews and information on amenities.
I hope this post has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out.
Thank you for reading.
A work by Mathias Steilen